Introduction¶
This project is a college basketball simulator. By taking statistics from every possession played in the 2024 NCAA Men's college basketball season, probabilities for teams were caluclated and can be used in a possession by possession simulation. The work for making the simulation is split up into a few parts. These parts can be broken down into the following:
- Data Collection
- Data Processing
- Statistical Analysis
- Simulating Games
The data collection step consists of reading in information from play-by-play data. This is necessary because the simulation attempts to simulate a game by estimating the result of each possession between two teams. In order to get accurate data for each team's possession results, play-by-play data is parsed.
The data processing step takes the collected data and prepares it to be used in the statistical analysis step. The raw data originally collected is interesting to look at, but not very useful in any kind of data science sense. It consists of season level data to look at how teams compare to each other over the course of the whole season, and game level data used to see how individual teams compared in a head to head matchup. The data processing step mainly looks at the game level data. It uses an arbitrary data that is around halfway into the season, and compiles the sum of the data before that point. For every game after that date, it updates the sum of the previous game stats, and also notes the stats from the current game. The data is then altered so that instead of listing how many possession results occured in the game, a row for each possession is included in the dataset. This makes the result categorical, so that it can be more easily analyzed in the next step.
The statistical analysis step is relatively simlpe. A multinomial logistic regression is performed on the data. The regression takes the offense's and defense's previous probabilities as inputs, and tries to come up with the probability for every possession result. This is done with SciKitLearn's LogisticRegression class.
After the statistical analysis is concluded, there is now a model that can take two teams' possession probabilities, and come up with expected results for their possessions. These probabilities are fed into a simulation which simulates a game possession by possession. Many of these games are simulated, their results are all listed, as well as their average score.
Data collection¶
For the data collection step, a number of functions are used to handle the different possibilities of a possession. The possibilities that pertain to the possession count consist of shots, rebounds, turnovers, and fouls. Free throws are also looked at to help simulate scores more accurately.
pandas will be used for lots of the work
import pandas as pd
# get rid of the max display columns so it is always possible to see all team statistics
pd.set_option('display.max_columns', None)
Handle Game¶
The handle game function is used to find all of the useful statistics necessary for simulating a game. It does this by reading through the play-by-play data for an individual game, and determining what kind of play each row is describing. If the kind of play is relevant to the simulation statistics, helper functions are used to break down the contents of that play.
def handle_game(group_data, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res):
# array to hold the dictionaries containing the team stats
stats = [home_team_poss_res, away_team_poss_res]
for play in group_data.itertuples():
play_type = play.type_text
if "Shot" in play_type and play_type != "Block Shot":
stats = handle_shot(play,home_team_id,away_team_id,home_team_poss_res,away_team_poss_res)
elif "Rebound" in play_type:
stats = handle_rebound(play,home_team_id,away_team_id,home_team_poss_res,away_team_poss_res)
elif "Turnover" in play_type:
stats = handle_turnover(play,home_team_id,away_team_id,home_team_poss_res,away_team_poss_res)
elif "Foul" in play_type:
stats = handle_foul(play,home_team_id,away_team_id,home_team_poss_res,away_team_poss_res)
elif "FreeThrow" in play_type:
stats = handle_free_throw(play,home_team_id,away_team_id,home_team_poss_res,away_team_poss_res)
return [stats[0], stats[1]]
Handle shot¶
The handle shot function is used to determine what happens after a shot in the game. It works as following:
Determine if the shot was made. If the shot was made, determine whether it was a two-point shot or a three-point shot. Adjust the team statistics accodingly. (Indicate what kind of shot was made, and increment the team's possession count)
If the shot was missed, increment the team's missed field goal count
def handle_shot(play, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res):
if play.scoring_play:
if play.score_value == 3:
if play.team_id == home_team_id:
home_team_poss_res['thr_fgm'] += 1
home_team_poss_res['poss'] += 1
elif play.team_id == away_team_id:
away_team_poss_res['thr_fgm'] += 1
away_team_poss_res['poss'] += 1
elif play.score_value == 2:
if play.team_id == home_team_id:
home_team_poss_res['two_fgm'] += 1
home_team_poss_res['poss'] += 1
elif play.team_id == away_team_id:
away_team_poss_res['two_fgm'] += 1
away_team_poss_res['poss'] += 1
# for plays that are not scoring plays
else:
if play.team_id == home_team_id:
home_team_poss_res['fg_miss'] += 1
if play.team_id == away_team_id:
away_team_poss_res['fg_miss'] += 1
return [home_team_poss_res, away_team_poss_res]
Handle Rebound¶
The handle shot function is used to determine the results of a rebound.
The first thing that is checked is if the rebound was an offensive rebound or a defensive rebound. If the rebound was an offensive rebound, the team that got the rebound is determined, and that team's offensive rebound count and possession count are incremented. While this is not the standard method for counting possessions, their possession count is incremented in this project to more accurately reflect how the probabilities for each result of their possession. (Elaborate here)
The procedure is slightly different for defensive rebounds. The function determines which team got the rebound, and the other team's possession count is incremented. This is because this project counts possessions at the end of the possession (rather than the start). A defensive rebound is due to the other team missing a shot, and ending their possession. Thus, defensive rebounds result in the opposing team's possession count being incremented.
def handle_rebound(play, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res):
if "Defensive" in play.text:
if play.team_id == home_team_id:
away_team_poss_res['poss'] += 1
elif play.team_id == away_team_id:
home_team_poss_res['poss'] += 1
elif "Offensive" in play.text:
if play.team_id == home_team_id:
home_team_poss_res['oreb'] += 1
home_team_poss_res['poss'] += 1
elif play.team_id == away_team_id:
away_team_poss_res['oreb'] += 1
away_team_poss_res['poss'] += 1
return [home_team_poss_res, away_team_poss_res]
Handle turnover¶
The handle turnover function is simple to follow. It is determined which team committed the turnover. As a turnover means that team loses their possession, that team's turnover count and possession count are then incremented.
def handle_turnover(play, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res):
if play.team_id == home_team_id:
home_team_poss_res['tov'] += 1
home_team_poss_res['poss'] += 1
elif play.team_id == away_team_id:
away_team_poss_res['tov'] += 1
away_team_poss_res['poss'] += 1
return [home_team_poss_res, away_team_poss_res]
Handle foul¶
The handle foul functions is important for tracking possessions that did not end in a shot. Check which team committed the foul, and update the opposing team's stats (the team that was fouled). The team that was fouled gets their possession count incremented, as well as the count for how many times they were fouled
def handle_foul(play, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res):
if play.team_id == home_team_id:
away_team_poss_res['got_fouled'] += 1
away_team_poss_res['poss'] += 1
elif play.team_id == away_team_id:
home_team_poss_res['got_fouled'] += 1
home_team_poss_res['poss'] += 1
return [home_team_poss_res, away_team_poss_res]
Handle free throw¶
The handle free throw function does nothing to affect possession counts; it is simply used to keep track of a team's free throw percentage for sake of the simulation. The team is determined, and either their free throws made or free throws missed are incremented accordingly.
def handle_free_throw(play, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res):
if play.scoring_play:
if play.team_id == home_team_id:
home_team_poss_res['ft'] += 1
elif play.team_id == away_team_id:
away_team_poss_res['ft'] += 1
elif not play.scoring_play:
if play.team_id == home_team_id:
home_team_poss_res['ft_miss'] += 1
elif play.team_id == away_team_id:
away_team_poss_res['ft_miss'] += 1
return [home_team_poss_res, away_team_poss_res]
Update stat_list¶
The update stat list function is used to keep track of team's stats across many games. The stat_list structure is a dictionary of dictionarys. The outer dictionary consists of team names as keys, and their statistics as values. The inner dictionaries consist of team statistics as keys, and the actual counts as values.
This function works by first determining whether or not a team is already in the dictionary. If it is not, it enters their stats from their first game. Otherwise, it uses more current game data to add to the team's total stat count across the whole season.
def update_stat_list(team_stat_list, home_team_name, away_team_name, home_team_poss_res, away_team_poss_res):
# update stats for home team
if home_team_name in team_stat_list:
for key, value in home_team_poss_res.items():
if type(value) != str:
team_stat_list[home_team_name][key] += value
else:
team_stat_list[home_team_name] = home_team_poss_res
# update stats for away team
if away_team_name in team_stat_list:
for key, value in away_team_poss_res.items():
if type(value) != str:
team_stat_list[away_team_name][key] += value
else:
team_stat_list[away_team_name] = away_team_poss_res
return team_stat_list
Update opp list¶
The update opp list function effectively tracks a team's defense. The first thing that is done is a renaming. By taking one team's name and adding "_defense" to it, and then attaching that to the other team's stats, the result is the first team's defensive statistics.
Besides the switching of the team names, the rest of this function operates exactly the same as the update stat list function. Check if a team exists in the list, if so update their stats accordingly. If they don't already exist in the list, initialize their stat values.
def update_opp_list(team_opp_list, home_team_name, away_team_name, home_team_poss_res, away_team_poss_res):
# Make copies of the input dictionaries
home_stats = home_team_poss_res.copy()
away_stats = away_team_poss_res.copy()
# Modify the copies
away_stats['team_name'] = home_team_name + "_defense"
home_stats['team_name'] = away_team_name + "_defense"
# Update stats for home team
if home_team_name in team_opp_list:
for key, value in away_stats.items():
if isinstance(value, (int, float)):
team_opp_list[home_team_name][key] += value
else:
team_opp_list[home_team_name] = away_stats
# Update stats for away team
if away_team_name in team_opp_list:
for key, value in home_stats.items():
if isinstance(value, (int, float)):
team_opp_list[away_team_name][key] += value
else:
team_opp_list[away_team_name] = home_stats
return team_opp_list
Core data collection loop¶
The loop to collect all of the data is rather simple due to having so many helper functions. The play-by-play data is read into a pandas dataframe from a csv file. As the file consists of many games listed one after another, the pandas groupby() function is used to look at each individual game by game_id. When looking at an individual game, the home team and away team is determined. The teams, as well as dictionaries ready to hold their stats are passed to the handle_game function. The handle_game function returns those updated dictionaries, and they are passed to the update_stat_list and update_opp_list functions. After reading through the whole game, the game date and each team's possession statistics are recorded in the result_list array. After doing this for the whole dataset, the team_stat_list and team_opp_list dictionaries contains every team's season statistics.
# read in the play by play data
games = pd.read_csv("2024_play_by_play.csv")
# group the play by play data to be able to look at individual games
grouped = games.groupby('game_id')
# two dictionaries used to keep track of offensive and defensive possession statistics
team_stat_list = {}
team_opp_list = {}
# used to keep track of when games occured and the team statistics from that game
result_list = []
for game_id, group_data in grouped:
home_team_id = group_data.iloc[0]["home_team_id"]
away_team_id = group_data.iloc[0]["away_team_id"]
home_team_name = group_data.iloc[0]["home_team_name"]
away_team_name = group_data.iloc[0]["away_team_name"]
game_date = group_data.iloc[0]["game_date"]
# dictionaries to keep track of team possession statistics
home_team_poss_res = {"team_name": home_team_name, "two_fgm": 0, "thr_fgm": 0, "fg_miss": 0, "ft": 0, "ft_miss": 0, "tov": 0, "oreb": 0, "got_fouled": 0, "poss": 0}
away_team_poss_res = {"team_name": away_team_name, "two_fgm": 0, "thr_fgm": 0, "fg_miss": 0, "ft": 0, "ft_miss": 0, "tov": 0, "oreb": 0, "got_fouled": 0, "poss": 0}
# handle game returns a list containing the possession statistics for each team
game_stats = handle_game(group_data, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res)
# take note of the game date, and add a copy of each teams statistics from the game_stats array
result_list.append([game_date, game_stats[0].copy(), game_stats[1].copy()])
team_stat_list = update_stat_list(team_stat_list, home_team_name, away_team_name, game_stats[0], game_stats[1] )
team_opp_list = update_opp_list( team_opp_list, home_team_name, away_team_name, game_stats[0], game_stats[1] )
Examples of season level stats¶
# turn offensive and defensive stats into pandas DataFrames
team_off_stats = pd.DataFrame.from_dict(team_stat_list, orient='index')
team_def_stats = pd.DataFrame.from_dict(team_opp_list, orient='index')
team_off_stats.head()
| team_name | two_fgm | thr_fgm | fg_miss | ft | ft_miss | tov | oreb | got_fouled | poss | |
|---|---|---|---|---|---|---|---|---|---|---|
| Eastern Washington | Eastern Washington | 620 | 274 | 889 | 525 | 156 | 422 | 327 | 628 | 2982 |
| Montana | Montana | 725 | 278 | 1068 | 477 | 125 | 371 | 330 | 582 | 3135 |
| Idaho | Idaho | 535 | 243 | 991 | 357 | 130 | 366 | 273 | 487 | 2705 |
| Montana State | Montana State | 608 | 300 | 1041 | 414 | 159 | 391 | 290 | 589 | 3068 |
| Idaho State | Idaho State | 623 | 218 | 1027 | 429 | 196 | 372 | 376 | 607 | 2970 |
team_def_stats.head()
| team_name | two_fgm | thr_fgm | fg_miss | ft | ft_miss | tov | oreb | got_fouled | poss | |
|---|---|---|---|---|---|---|---|---|---|---|
| Eastern Washington | Eastern Washington_defense | 549 | 282 | 1076 | 454 | 158 | 385 | 396 | 587 | 3029 |
| Montana | Montana_defense | 697 | 225 | 1159 | 509 | 197 | 349 | 397 | 606 | 3210 |
| Idaho | Idaho_defense | 546 | 236 | 945 | 464 | 172 | 356 | 316 | 571 | 2780 |
| Montana State | Montana State_defense | 678 | 222 | 1035 | 495 | 199 | 457 | 392 | 637 | 3201 |
| Idaho State | Idaho State_defense | 665 | 200 | 954 | 422 | 144 | 415 | 294 | 549 | 2882 |
Data Processing¶
The goal of the data processing step is to take all of the data collected above and format it so that a multinomial logistic regression can be performed. This means taking aggregate possession data from previous games, turning it into percentages, and coming up with a categorical results column.
Game by game data¶
This code makes a dataframe to hold data (from each team's perspective) from every individual game. The data is read from the result_list array. The game date is noted, as well as each team's possession statistics from said game. One row is made in the dataframe from the perspective of the home team, and another row is made from the perspective of the away team. The end result is a dataframe with two entries for every game played.
# sort the list of results by game_date
result_list= sorted(result_list, key=lambda x: x[0])
stats_on_date = pd.DataFrame({})
for row in result_list:
date = row[0]
team1 = row[1]
team2 = row[2]
game = {"date": date,
"team_name": team1['team_name'],
"team_twos": team1['two_fgm'],
"team_threes": team1['thr_fgm'],
"team_miss": team1['fg_miss'],
"team_ft": team1['ft'],
"team_ft_miss": team1['ft_miss'],
"team_tov": team1['tov'],
"team_oreb": team1['oreb'],
"team_fouled": team1['got_fouled'],
"team_poss": team1['poss'],
"opp_name": team2['team_name'],
"opp_twos": team2['two_fgm'],
"opp_threes": team2['thr_fgm'],
"opp_miss": team2['fg_miss'],
"opp_ft": team2['ft'],
"opp_ft_miss": team2['ft_miss'],
"opp_tov": team2['tov'],
"opp_oreb": team2['oreb'],
"opp_fouled": team2['got_fouled'],
"opp_poss": team2['poss']
}
opp_game = {"date": date,
"team_name": team2['team_name'],
"team_twos": team2['two_fgm'],
"team_threes": team2['thr_fgm'],
"team_miss": team2['fg_miss'],
"team_ft": team2['ft'],
"team_ft_miss": team2['ft_miss'],
"team_tov": team2['tov'],
"team_oreb": team2['oreb'],
"team_fouled": team2['got_fouled'],
"team_poss": team2['poss'],
"opp_name": team1['team_name'],
"opp_twos": team1['two_fgm'],
"opp_threes": team1['thr_fgm'],
"opp_miss": team1['fg_miss'],
"opp_ft": team1['ft'],
"opp_ft_miss": team1['ft_miss'],
"opp_tov": team1['tov'],
"opp_oreb": team1['oreb'],
"opp_fouled": team1['got_fouled'],
"opp_poss": team1['poss']
}
# add values from home team's perspective
new_row = pd.DataFrame.from_dict([game])
stats_on_date = pd.concat([stats_on_date, new_row])
# add values from away team's perspective
new_row = pd.DataFrame.from_dict([opp_game])
stats_on_date = pd.concat([stats_on_date, new_row])
stats_on_date = stats_on_date.reset_index(drop=True)
stats_on_date.head(10)
| date | team_name | team_twos | team_threes | team_miss | team_ft | team_ft_miss | team_tov | team_oreb | team_fouled | team_poss | opp_name | opp_twos | opp_threes | opp_miss | opp_ft | opp_ft_miss | opp_tov | opp_oreb | opp_fouled | opp_poss | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2023-11-06 | Tulsa | 16 | 8 | 36 | 14 | 6 | 18 | 20 | 18 | 100 | Central Arkansas | 14 | 6 | 40 | 7 | 5 | 12 | 11 | 15 | 89 |
| 1 | 2023-11-06 | Central Arkansas | 14 | 6 | 40 | 7 | 5 | 12 | 11 | 15 | 89 | Tulsa | 16 | 8 | 36 | 14 | 6 | 18 | 20 | 18 | 100 |
| 2 | 2023-11-06 | Northwestern | 20 | 5 | 34 | 17 | 4 | 12 | 16 | 19 | 94 | Binghamton | 15 | 7 | 31 | 10 | 2 | 19 | 10 | 16 | 90 |
| 3 | 2023-11-06 | Binghamton | 15 | 7 | 31 | 10 | 2 | 19 | 10 | 16 | 90 | Northwestern | 20 | 5 | 34 | 17 | 4 | 12 | 16 | 19 | 94 |
| 4 | 2023-11-06 | Syracuse | 23 | 5 | 39 | 22 | 5 | 11 | 17 | 22 | 105 | New Hampshire | 17 | 8 | 43 | 14 | 5 | 16 | 15 | 17 | 106 |
| 5 | 2023-11-06 | New Hampshire | 17 | 8 | 43 | 14 | 5 | 16 | 15 | 17 | 106 | Syracuse | 23 | 5 | 39 | 22 | 5 | 11 | 17 | 22 | 105 |
| 6 | 2023-11-06 | Minnesota | 19 | 5 | 22 | 27 | 8 | 17 | 13 | 29 | 100 | Bethune-Cookman | 19 | 4 | 47 | 10 | 4 | 14 | 23 | 16 | 104 |
| 7 | 2023-11-06 | Bethune-Cookman | 19 | 4 | 47 | 10 | 4 | 14 | 23 | 16 | 104 | Minnesota | 19 | 5 | 22 | 27 | 8 | 17 | 13 | 29 | 100 |
| 8 | 2023-11-06 | Nebraska | 16 | 11 | 29 | 19 | 10 | 9 | 8 | 21 | 92 | Lindenwood | 19 | 3 | 46 | 5 | 4 | 12 | 13 | 12 | 95 |
| 9 | 2023-11-06 | Lindenwood | 19 | 3 | 46 | 5 | 4 | 12 | 13 | 12 | 95 | Nebraska | 16 | 11 | 29 | 19 | 10 | 9 | 8 | 21 | 92 |
Mid season aggregate data¶
The following code comes up with the aggregate possession data. Feburary 1st is chosen as the point to start calculating game by game data. Every game before Feb 1 has its data summed up. Next, every game after Feb 1 is iterated through, the team's aggregate data is found and added to, and the game data for the current game is kept the same.
# Convert date column to datetime type
stats_on_date['date'] = pd.to_datetime(stats_on_date['date'])
start_date = '2024-02-01'
# Filter dataframe to include only games after or on the start date
filtered_df = stats_on_date[stats_on_date['date'] >= start_date]
# Initialize list to store rows for the new dataframe
new_rows = []
team_stats_columns = ["team_twos", "team_threes", "team_miss", "team_tov", "team_oreb", "team_fouled", "team_poss"]
opp_stats_columns = ["opp_twos", "opp_threes", "opp_miss", "opp_tov", "opp_oreb", "opp_fouled", "opp_poss"]
# Iterate through each game in the original dataframe
for index, game in filtered_df.iterrows():
# Extract team name and opponent name for the current game
team_name = game['team_name']
opp_name = game['opp_name']
# Calculate sum of stats for team and opponent based on games before the current game
team_prev_sum = stats_on_date[stats_on_date['team_name'] == team_name].loc[:index-1][team_stats_columns].sum()
opp_prev_sum = stats_on_date[stats_on_date['opp_name'] == opp_name].loc[:index-1][team_stats_columns].sum()
# Extract team stats for the current game
team_game_stat = game[team_stats_columns].tolist()
# Combine all data into a single row
new_row = [game['date'], team_name] + team_prev_sum.tolist() + [opp_name] + opp_prev_sum.tolist() + team_game_stat
# Append row to the list
new_rows.append(new_row)
new_columns = ["date",
"team_name", "prev_team_twos", "prev_team_threes", "prev_team_miss", "prev_team_tov", "prev_team_oreb", "prev_team_fouled", "prev_team_poss",
"opp_name", "prev_opp_twos", "prev_opp_threes", "prev_opp_miss", "prev_opp_tov", "prev_opp_oreb", "prev_opp_fouled", "prev_opp_poss"] + \
team_stats_columns
mid_season_data = pd.DataFrame(new_rows, columns=new_columns)
mid_season_data.head(6)
| date | team_name | prev_team_twos | prev_team_threes | prev_team_miss | prev_team_tov | prev_team_oreb | prev_team_fouled | prev_team_poss | opp_name | prev_opp_twos | prev_opp_threes | prev_opp_miss | prev_opp_tov | prev_opp_oreb | prev_opp_fouled | prev_opp_poss | team_twos | team_threes | team_miss | team_tov | team_oreb | team_fouled | team_poss | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2024-02-01 | Montana State | 361.0 | 163.0 | 603.0 | 234.0 | 164.0 | 355.0 | 1803.0 | Eastern Washington | 345.0 | 156.0 | 675.0 | 253.0 | 252.0 | 374.0 | 1900.0 | 15 | 9 | 41 | 12 | 19 | 20 | 105 |
| 1 | 2024-02-01 | Eastern Washington | 370.0 | 188.0 | 570.0 | 265.0 | 212.0 | 372.0 | 1858.0 | Montana State | 394.0 | 128.0 | 594.0 | 266.0 | 231.0 | 392.0 | 1889.0 | 20 | 2 | 27 | 18 | 9 | 18 | 92 |
| 2 | 2024-02-01 | Montana | 439.0 | 157.0 | 643.0 | 215.0 | 203.0 | 327.0 | 1850.0 | Idaho | 335.0 | 158.0 | 609.0 | 227.0 | 196.0 | 345.0 | 1751.0 | 16 | 8 | 25 | 11 | 10 | 21 | 87 |
| 3 | 2024-02-01 | Idaho | 333.0 | 158.0 | 647.0 | 218.0 | 182.0 | 313.0 | 1729.0 | Montana | 404.0 | 132.0 | 678.0 | 209.0 | 230.0 | 351.0 | 1885.0 | 19 | 9 | 27 | 10 | 6 | 8 | 74 |
| 4 | 2024-02-01 | Northern Colorado | 421.0 | 169.0 | 634.0 | 220.0 | 205.0 | 336.0 | 1868.0 | Idaho State | 417.0 | 114.0 | 588.0 | 255.0 | 198.0 | 341.0 | 1793.0 | 19 | 9 | 30 | 10 | 5 | 22 | 94 |
| 5 | 2024-02-01 | Idaho State | 381.0 | 134.0 | 630.0 | 244.0 | 233.0 | 353.0 | 1828.0 | Northern Colorado | 370.0 | 184.0 | 658.0 | 244.0 | 192.0 | 323.0 | 1851.0 | 17 | 10 | 44 | 11 | 19 | 27 | 115 |
Possession results as percentages¶
This code takes the aggregate possession data and turns each possession result into the percentages of all the team's possessions.
poss_percent = mid_season_data.copy()
for index, row in poss_percent.iterrows():
if row['prev_team_poss'] > 0:
prev_team_poss = row['prev_team_poss']
poss_percent.at[index, 'prev_team_twos'] = row['prev_team_twos'] / prev_team_poss
poss_percent.at[index, 'prev_team_threes'] = row['prev_team_threes'] / prev_team_poss
poss_percent.at[index, 'prev_team_miss'] = row['prev_team_miss'] / prev_team_poss
poss_percent.at[index, 'prev_team_tov'] = row['prev_team_tov'] / prev_team_poss
poss_percent.at[index, 'prev_team_oreb'] = row['prev_team_oreb'] / prev_team_poss
poss_percent.at[index, 'prev_team_fouled'] = row['prev_team_fouled'] / prev_team_poss
if row['prev_opp_poss'] > 0:
prev_opp_poss = row['prev_opp_poss']
poss_percent.at[index, 'prev_opp_twos'] = row['prev_opp_twos'] / prev_opp_poss
poss_percent.at[index, 'prev_opp_threes'] = row['prev_opp_threes'] / prev_opp_poss
poss_percent.at[index, 'prev_opp_miss'] = row['prev_opp_miss'] / prev_opp_poss
poss_percent.at[index, 'prev_opp_tov'] = row['prev_opp_tov'] / prev_opp_poss
poss_percent.at[index, 'prev_opp_oreb'] = row['prev_opp_oreb'] / prev_opp_poss
poss_percent.at[index, 'prev_opp_fouled'] = row['prev_opp_fouled'] / prev_opp_poss
poss_percent.head(6)
| date | team_name | prev_team_twos | prev_team_threes | prev_team_miss | prev_team_tov | prev_team_oreb | prev_team_fouled | prev_team_poss | opp_name | prev_opp_twos | prev_opp_threes | prev_opp_miss | prev_opp_tov | prev_opp_oreb | prev_opp_fouled | prev_opp_poss | team_twos | team_threes | team_miss | team_tov | team_oreb | team_fouled | team_poss | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2024-02-01 | Montana State | 0.200222 | 0.090405 | 0.334443 | 0.129784 | 0.090960 | 0.196894 | 1803.0 | Eastern Washington | 0.181579 | 0.082105 | 0.355263 | 0.133158 | 0.132632 | 0.196842 | 1900.0 | 15 | 9 | 41 | 12 | 19 | 20 | 105 |
| 1 | 2024-02-01 | Eastern Washington | 0.199139 | 0.101184 | 0.306781 | 0.142626 | 0.114101 | 0.200215 | 1858.0 | Montana State | 0.208576 | 0.067761 | 0.314452 | 0.140815 | 0.122287 | 0.207517 | 1889.0 | 20 | 2 | 27 | 18 | 9 | 18 | 92 |
| 2 | 2024-02-01 | Montana | 0.237297 | 0.084865 | 0.347568 | 0.116216 | 0.109730 | 0.176757 | 1850.0 | Idaho | 0.191319 | 0.090234 | 0.347801 | 0.129640 | 0.111936 | 0.197030 | 1751.0 | 16 | 8 | 25 | 11 | 10 | 21 | 87 |
| 3 | 2024-02-01 | Idaho | 0.192597 | 0.091382 | 0.374205 | 0.126084 | 0.105263 | 0.181029 | 1729.0 | Montana | 0.214324 | 0.070027 | 0.359682 | 0.110875 | 0.122016 | 0.186207 | 1885.0 | 19 | 9 | 27 | 10 | 6 | 8 | 74 |
| 4 | 2024-02-01 | Northern Colorado | 0.225375 | 0.090471 | 0.339400 | 0.117773 | 0.109743 | 0.179872 | 1868.0 | Idaho State | 0.232571 | 0.063581 | 0.327942 | 0.142220 | 0.110429 | 0.190184 | 1793.0 | 19 | 9 | 30 | 10 | 5 | 22 | 94 |
| 5 | 2024-02-01 | Idaho State | 0.208425 | 0.073304 | 0.344639 | 0.133479 | 0.127462 | 0.193107 | 1828.0 | Northern Colorado | 0.199892 | 0.099406 | 0.355484 | 0.131821 | 0.103728 | 0.174500 | 1851.0 | 17 | 10 | 44 | 11 | 19 | 27 | 115 |
Categorical Results¶
This cell makes the categorical result column. Every game is read through, and for each possession result from that games, a loop is created and every result gets its own line.
row_data = []
for index, row in poss_percent.iterrows():
twos = row['team_twos']
threes = row['team_threes']
misses = row['team_miss']
tov = row['team_tov']
oreb = row['team_oreb']
fouls = row['team_fouled']
poss = row['team_poss']
data = row.values.tolist()
prev_data = data[0:17]
for i in range(misses):
res = [0]
new_row = prev_data + res
row_data.append(new_row)
for i in range(oreb):
res = [1]
new_row = prev_data + res
row_data.append(new_row)
for i in range(twos):
res = [2]
new_row = prev_data + res
row_data.append(new_row)
for i in range(threes):
res = [3]
new_row = prev_data + res
row_data.append(new_row)
for i in range(tov):
res = [4]
new_row = prev_data + res
row_data.append(new_row)
for i in range(fouls):
res = [5]
new_row = prev_data + res
row_data.append(new_row)
columns = ['date', 'team_name', 'prev_twos', 'prev_threes', 'prev_miss', 'prev_tov', 'prev_oreb', 'prev_fouled', 'prev_poss',
'opp_name', 'opp_twos', 'opp_threes', 'opp_miss', 'opp_tov', 'opp_oreb', 'opp_fouled','opp_poss',
'result']
game_results = pd.DataFrame(row_data, columns=columns)
game_results.tail(6)
| date | team_name | prev_twos | prev_threes | prev_miss | prev_tov | prev_oreb | prev_fouled | prev_poss | opp_name | opp_twos | opp_threes | opp_miss | opp_tov | opp_oreb | opp_fouled | opp_poss | result | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 425629 | 2024-04-08 | Purdue | 0.210003 | 0.087593 | 0.31307 | 0.119923 | 0.149489 | 0.216358 | 3619.0 | UConn | 0.185498 | 0.067851 | 0.394292 | 0.120268 | 0.128422 | 0.186372 | 3434.0 | 5 |
| 425630 | 2024-04-08 | Purdue | 0.210003 | 0.087593 | 0.31307 | 0.119923 | 0.149489 | 0.216358 | 3619.0 | UConn | 0.185498 | 0.067851 | 0.394292 | 0.120268 | 0.128422 | 0.186372 | 3434.0 | 5 |
| 425631 | 2024-04-08 | Purdue | 0.210003 | 0.087593 | 0.31307 | 0.119923 | 0.149489 | 0.216358 | 3619.0 | UConn | 0.185498 | 0.067851 | 0.394292 | 0.120268 | 0.128422 | 0.186372 | 3434.0 | 5 |
| 425632 | 2024-04-08 | Purdue | 0.210003 | 0.087593 | 0.31307 | 0.119923 | 0.149489 | 0.216358 | 3619.0 | UConn | 0.185498 | 0.067851 | 0.394292 | 0.120268 | 0.128422 | 0.186372 | 3434.0 | 5 |
| 425633 | 2024-04-08 | Purdue | 0.210003 | 0.087593 | 0.31307 | 0.119923 | 0.149489 | 0.216358 | 3619.0 | UConn | 0.185498 | 0.067851 | 0.394292 | 0.120268 | 0.128422 | 0.186372 | 3434.0 | 5 |
| 425634 | 2024-04-08 | Purdue | 0.210003 | 0.087593 | 0.31307 | 0.119923 | 0.149489 | 0.216358 | 3619.0 | UConn | 0.185498 | 0.067851 | 0.394292 | 0.120268 | 0.128422 | 0.186372 | 3434.0 | 5 |
Statistical Analysis¶
Multinomial Logistic Regression¶
A multinomial regression is a statistical method used to predict the outcome of a categorical dependent variable with more than two categories. This project uses input data of a team's offensive capabilities (measured by their percentage of possessions that end in missed shots, two-point shots, three-point shots, offensive rebounds, turnovers, and fouls), and another team's defensive capabilities (measured by the same results). By running a multinomial regression on a team's offense and team's defense, probabilities for the team's offensive possession ending in one of those results are calculated.
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
# define the multinomial logistic regression model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)
# fit the model on the whole dataset
test_columns = ['prev_twos', 'prev_threes', 'prev_miss', 'prev_tov', 'prev_oreb', 'prev_fouled', 'opp_twos', 'opp_threes', 'opp_miss', 'opp_tov', 'opp_oreb', 'opp_fouled']
X = game_results[test_columns]
y = game_results['result']
model.fit(X, y)
# take a random row of data from the game_results dataframe to test
row = [0.210003, 0.087593, 0.31307, 0.119923, 0.149489, 0.216358, 0.185498, 0.067851, 0.394292, 0.120268, 0.128422, 0.186372]
row_df = pd.DataFrame([row], columns=test_columns)
# predict a multinomial probability distribution
results = model.predict_proba(row_df)
# summarize the predicted probabilities
print(f"Predicted Probabilities of Purdue when facing UConn:\n{results[0]}")
Predicted Probabilities of Purdue when facing UConn: [0.32276465 0.12013916 0.18664887 0.07507893 0.10356496 0.19180343]
Simulating Games¶
The game simulation is done by simulating alternating possessions between two teams. The team's possession result probabilities are obtained from the model built with multinomial regression. Probabilities are used from both teams' offense and defense in order to come up with the liklihood for each team scoring on a given possession. After simulating a predefined number of possessions, the scores are reported.
Get team possession probabilities¶
This function is used to determine each team's possession result probabilities. The first thing that is done is obtaining the most recent aggregate probabilities for each team's offense and defense. These are combined so that one team's offensive and the other team's defensive probabilities are grouped together. These are then passed to the model built on the multinomial regression, and the calculated values are returned.
# get team probabilities
def get_probs(team1, team2):
team1_off = game_results[game_results['team_name'] == team1][['prev_twos', 'prev_threes', 'prev_miss', 'prev_tov', 'prev_oreb', 'prev_fouled']].iloc[-1].tolist()
team2_off = game_results[game_results['team_name'] == team2][['prev_twos', 'prev_threes', 'prev_miss', 'prev_tov', 'prev_oreb', 'prev_fouled']].iloc[-1].tolist()
team1_def = game_results[game_results['opp_name'] == team1][['opp_twos', 'opp_threes', 'opp_miss', 'opp_tov', 'opp_oreb', 'opp_fouled']].iloc[-1].tolist()
team2_def = game_results[game_results['opp_name'] == team1][['opp_twos', 'opp_threes', 'opp_miss', 'opp_tov', 'opp_oreb', 'opp_fouled']].iloc[-1].tolist()
most_recent_team1 = team1_off + team2_def
most_recent_team2 = team2_off + team1_def
test_columns = ['prev_twos', 'prev_threes', 'prev_miss', 'prev_tov', 'prev_oreb', 'prev_fouled', 'opp_twos', 'opp_threes', 'opp_miss', 'opp_tov', 'opp_oreb', 'opp_fouled']
team1_probs = model.predict_proba(pd.DataFrame([most_recent_team1], columns=test_columns)).tolist()[0]
team2_probs = model.predict_proba(pd.DataFrame([most_recent_team2], columns=test_columns)).tolist()[0]
return [team1_probs, team2_probs]
probs = get_probs("North Carolina", "Duke")
print(f"Example probabilities between North Carolina and Duke:\n{probs[0]}\n{probs[1]}")
Example probabilities between North Carolina and Duke: [0.36136018257551633, 0.10115549297568453, 0.19492609017437149, 0.07580382814234912, 0.09460779242619718, 0.17214661370588133] [0.3535532476244952, 0.09711739050523685, 0.2028816180158325, 0.08113898256442995, 0.0960341810888952, 0.1692745802011103]
Get team misc stats¶
Since the dataframe containing the aggregate possession result probabilities does not contain helpful information about free throw percentage or possession count, these numbers are obtained in the get_misc_stats function. The teams' free throw counts are obtained from the team_off_stats dataframe which contains season-level statistics, and the free throw percentage is calculated.
Next the average number of possessions per game for each team is determined. This is done with the stats_on_date dataframe. This dataframe contains game-level stats for every team, and the average possession count is found by simply calling the mean function on the possession column for each team.
def get_misc_stats(team1, team2):
# grab free throw stats for teams
team1_fts = team_off_stats.loc[team1][['ft', 'ft_miss']].tolist()
team2_fts = team_off_stats.loc[team2][['ft', 'ft_miss']].tolist()
# calculate free throw percentage
team1_ft_p = team1_fts[0] / (team1_fts[0] + team1_fts[1])
team2_ft_p = team2_fts[0] / (team2_fts[0] + team2_fts[1])
# grab average number of possesions in a game
team1_poss = round(stats_on_date.loc[stats_on_date['team_name'] == team1, 'team_poss'].mean())
team2_poss = round(stats_on_date.loc[stats_on_date['team_name'] == team2, 'team_poss'].mean())
team1_stats = [team1_poss, team1_ft_p]
team2_stats = [team2_poss, team2_ft_p]
return [team1_stats, team2_stats]
stats = get_misc_stats("North Carolina", "Duke")
print(f"Example of the misc stats collected for North Carolina and Duke:\n{stats[0]}\n{stats[1]}")
Example of the misc stats collected for North Carolina and Duke: [94, 0.7589073634204275] [87, 0.7227586206896551]
Simulate a possession¶
The sim_poss function is called to simulate a possession and record its result during a game simulation. It works by simulating a team's offensive possession. That team's possession result probabilities are given as an argument to the function. Using those probabilities, a result is chosen with the random.choices method. Depending on the result, the score that the result of the play would finish in is returned.
import random
def adj(mean, mu):
return random.normalvariate(mean, mu)
def sim_poss( probs ):
while ( True ):
# list of every option for a given possession
options = ['fg_miss', 'two_pointer', 'three_pointer', 'turnover', 'foul']
# list of probability for every possession option
probabilities = [adj(probs[0], .0), # miss
adj(probs[2], .00), # two
adj(probs[3], .0), # three
adj(probs[4], .00), # tov
adj(probs[5], .00) ] # foul
# randomly choose possesion option
result = random.choices(options, weights=probabilities, k=1)[0]
# return how each possession option will affect the score
if result == 'fg_miss':
# if offensive rebound, keep current possession
x = random.random()
if (x < adj(probs[1], .008 ) ) :
pass
else:
return 0
elif result == 'two_pointer':
return 2
elif result == 'three_pointer':
return 3
elif result == 'foul':
ft_made = 0
# simulate two free throw shots
for i in range(2):
x = random.random()
if (x < adj(probs[6], .015) ):
ft_made += 1
return ft_made
else:
return 0
Average function¶
Very simple helper function used to calculate the average score after lots of games are simulated
# used to compute the average score of lots of simulated games
def Average(x):
return sum(x) / len(x)
Core game simulation loop¶
The sim_games function is the culmination of this project. It repeatedly simulates games between two user-given teams.
It first asks for the two teams that should be playing each other in the simulation. It uses the get_probs and get_misc_stats functions to obtain the teams' necessary statistics, and combines these into one array.
Next a number of games (given by the optional argument to the function) are simulated. A game is simulated by calling the sim_poss function until the sum of each team's average possession count has been reached. The scores of the game are noted, as well as a win count for each team. After the requested number of games has been reached (or ten games if no argument was given), the scores of each game are reported, as well as each team's average score and win count.
def sim_games(num_games = 10, team1 = "North Carolina", team2 = "Duke"):
# grab team stats from helper functions
probs = get_probs(team1, team2)
misc = get_misc_stats(team1, team2)
# these contain the possession result probabilities
team1_probs = probs[0]
team2_probs = probs[1]
# these contain average possession count and free throw percentage
team1_misc = misc[0]
team2_misc = misc[1]
# add the two teams average possession count to get the possession for simulated game
max_poss = team1_misc[0] + team2_misc[0]
# make one array with the all of the teams necessary stats
both_team_probs = [team1_probs + [team1_misc[1]], team2_probs + [team2_misc[1]] ]
# used to keep track of the scores across multiple sims
team1_scores = []
team2_scores = []
team1_wins = 0
team2_wins = 0
for i in range(num_games):
scores = [0,0]
curr_poss = 0
while curr_poss < max_poss:
team = curr_poss%2
scores[team] += sim_poss(both_team_probs[team] )
curr_poss+=1
team1_scores.append(scores[0])
team2_scores.append(scores[1])
if (scores[0] > scores[1]):
team1_wins += 1
else:
team2_wins += 1
print(f"Game {i+1:2.0f} {team1:>20} {scores[0]:3.0f} \t {scores[1]:3.0f} {team2:<20} ")
print(f"\nAverage {team1:^30} score: {Average(team1_scores):.2f}" +
f"\nAverage {team2:^30} score: {Average(team2_scores):.2f}")
print(f"\n{team1:<30} wins: {team1_wins:2.0f}" +
f"\n{team2:<30} wins: {team2_wins:2.0f}")
Simulated results¶
Simulating ten game between North Carolina and Butler
sim_games(team1="North Carolina", team2="Butler")
Game 1 North Carolina 103 89 Butler Game 2 North Carolina 107 90 Butler Game 3 North Carolina 108 101 Butler Game 4 North Carolina 116 74 Butler Game 5 North Carolina 75 89 Butler Game 6 North Carolina 101 101 Butler Game 7 North Carolina 84 83 Butler Game 8 North Carolina 100 85 Butler Game 9 North Carolina 99 102 Butler Game 10 North Carolina 92 69 Butler Average North Carolina score: 98.50 Average Butler score: 88.30 North Carolina wins: 7 Butler wins: 3
Simulating ten games between Duke and North Carolina
sim_games(team1="Duke", team2="North Carolina")
Game 1 Duke 87 107 North Carolina Game 2 Duke 86 86 North Carolina Game 3 Duke 99 88 North Carolina Game 4 Duke 83 91 North Carolina Game 5 Duke 78 84 North Carolina Game 6 Duke 90 72 North Carolina Game 7 Duke 110 98 North Carolina Game 8 Duke 90 99 North Carolina Game 9 Duke 101 94 North Carolina Game 10 Duke 85 93 North Carolina Average Duke score: 90.90 Average North Carolina score: 91.20 Duke wins: 4 North Carolina wins: 6
Simulating twenty games between Houston and Purdue
sim_games(20, "Houston", "Purdue")
Game 1 Houston 99 87 Purdue Game 2 Houston 68 90 Purdue Game 3 Houston 91 79 Purdue Game 4 Houston 90 98 Purdue Game 5 Houston 85 90 Purdue Game 6 Houston 75 103 Purdue Game 7 Houston 96 107 Purdue Game 8 Houston 80 95 Purdue Game 9 Houston 87 72 Purdue Game 10 Houston 86 119 Purdue Game 11 Houston 83 94 Purdue Game 12 Houston 96 97 Purdue Game 13 Houston 94 101 Purdue Game 14 Houston 84 72 Purdue Game 15 Houston 103 86 Purdue Game 16 Houston 95 81 Purdue Game 17 Houston 91 95 Purdue Game 18 Houston 98 81 Purdue Game 19 Houston 87 93 Purdue Game 20 Houston 85 85 Purdue Average Houston score: 88.65 Average Purdue score: 91.25 Houston wins: 7 Purdue wins: 13
Simulating twenty games between Purdue and UConn
sim_games(20, "Purdue", "UConn")
Game 1 Purdue 90 103 UConn Game 2 Purdue 93 100 UConn Game 3 Purdue 91 106 UConn Game 4 Purdue 86 124 UConn Game 5 Purdue 104 112 UConn Game 6 Purdue 105 117 UConn Game 7 Purdue 102 101 UConn Game 8 Purdue 98 90 UConn Game 9 Purdue 93 96 UConn Game 10 Purdue 90 94 UConn Game 11 Purdue 112 87 UConn Game 12 Purdue 103 102 UConn Game 13 Purdue 98 105 UConn Game 14 Purdue 100 107 UConn Game 15 Purdue 100 88 UConn Game 16 Purdue 92 101 UConn Game 17 Purdue 105 100 UConn Game 18 Purdue 85 107 UConn Game 19 Purdue 125 92 UConn Game 20 Purdue 105 100 UConn Average Purdue score: 98.85 Average UConn score: 101.60 Purdue wins: 8 UConn wins: 12